Project 3

Option C: Use collaborative filtering to build a custom recommendation system

Member 1: Daniel Rodriguez-Gonzalez

Member 2: Darrel Pyle

Member 3: Josh Ruiz

Models

Model 1:

We will use the grouplens.org movielens dataset in conjunction with data from the Internet Movie Database (IMDb) to build a recommendation system that takes into account actors, movie certification, and IMDb ratings with user ratings from movieLens. We will test to see if the additional IMDb data adds performance value to the recommender.

The imdbpie library was used to connect to IMDb's API and download data.

Model 2:

We also tested a recommendation system that would recommend similar actors based on their performances in particular movie genres. The target in this case is still the user's ratings.

Model Evaluation:

The performance of each model will be evaluated via a precision-recall curve. The model that scores higher on both of these measures is preferred. A 80/20 train/test split is used to train and then test models and to generate the precision-recall curves.

Business Cases:

Model 1:

In this case, we wish to build a model that can be used by a streaming service (Netflix, AppleTV, Amazon Prime) to recommend movies to users based on their ratings of other movies.

Deployment:

As a company would initially run the model with their user rating data and present other movies to an user. As users enter more ratings, the model can be refreshed so that the company offers better recommendations to its users.

Additional Data:

Gathering twitter streams to evaluate sentiment analysis on movies could also provide a measure of success for each movie. Also, in this analysis we incorporated side data for the items, we would also like to be able to incorporate side data on the users like Age, gender, geographic location, socio-economic status, etc.

Model 2:

In this case, a casting agent would have a particular actor in mind for a role, but they would also like to have a recommendation on actors that would also have similar ratings in the genre space of the movie they are making. This could provide an unbiased approach to selecting actors based on their past performances with user ratings.

Deployment:

This can be a tool for studios given to their casting managers. It can could be provided as an app that is updated as more user ratings are available and more movies are made.

Additional Data:

Gross ticket sales by actor could help in determining which actor can generate the most revenue for a movie. Actor salaries would also be beneficial in balancing movie budgets. A recommendation system could be developed to help structure a movie financially.

Conclusions Summary:

Model 1:

It was surprising that the best model was an item_similarity model despite downloading extra data for an user-item model. The additional side data for the items (movies) did improve the model over the model that did not include this data, but the item-item models always scored higher without the side data.

Model 2:

In some case the recommendation make a lot of sense, for example, Bruce Willis is paired with Jason Statham. However, the models overall do not show high precision or recall. We believe that this model can be improved with more data and analysis.

Importing Required Libraries and Datasets

In [1]:
# Import libraries needed for processing and visualizations
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

From the README.txt file included with the MovieLens dataset:

Summary

This dataset describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users between April 03, 1996 and January 09, 2016. This dataset was generated on January 11, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, links.csv, movies.csv, ratings.csv and tags.csv.

In [2]:
#Importing dataset from MovieLens
df_links = pd.read_csv('links.csv')
df_tags = pd.read_csv('tags.csv')
df_ratings = pd.read_csv('ratings.csv')
df_movies = pd.read_csv('movies.csv')

Data Understanding

The links file contains 10,329 rows and 3 columns. The movieId is a unique identifier for movies. The imdbId is an indentifier for movies in the IMdb website and the tmdbId column identifies movies in the themoviedb site.

Each row in the links file is a unique movie.

In [3]:
print (df_links.shape)
print (df_links.movieId.unique().shape)
df_links.head()
(10329, 3)
(10329,)
Out[3]:
movieId imdbId tmdbId
0 1 114709 862.0
1 2 113497 8844.0
2 3 113228 15602.0
3 4 114885 31357.0
4 5 113041 11862.0

Movies File

The movies file also contains 10,329 unique rows, each identifying a movie. The other columns detail the title of the movie and the genres for the movie.

In [4]:
print (df_movies.shape)
print (df_movies.movieId.unique().shape)
df_movies.head()
(10329, 3)
(10329,)
Out[4]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

Ratings File

The ratings file contains 4 columns and 105,339 entries.

There are only 10,325 unique movies represented in the dataset with 668 unique users' ratings.

In [5]:
print (df_ratings.shape)
print (df_ratings.movieId.unique().shape)
print (df_ratings.userId.unique().shape)
df_ratings.head()
(105339, 4)
(10325,)
(668,)
Out[5]:
userId movieId rating timestamp
0 1 16 4.0 1217897793
1 1 24 1.5 1217895807
2 1 32 4.0 1217896246
3 1 47 4.0 1217896556
4 1 50 4.0 1217896523

As we can see from the Histogram below of all ratings in the ratings.csv file, the ratings are entered in increments of 0.5 and the distribution shows a left skew.

In [6]:
df_ratings.rating.hist(bins = 40)
plt.title('Ratings Histogram')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

Also, the ratings file records ratings assigned to movies by users, as can be seen below when we group by userId and movieId.

In [7]:
print (df_ratings.groupby(['userId','movieId'])['rating'].count().unique())
df_ratings.groupby(['userId','movieId']).count().head(10)
[1]
Out[7]:
rating timestamp
userId movieId
1 16 1 1
24 1 1
32 1 1
47 1 1
50 1 1
110 1 1
150 1 1
161 1 1
165 1 1
204 1 1

Tags File

The tags file associates a textual tag that an user assigned to a particular movie.

There are 6,138 tags in the dataframe.

In [8]:
print (df_tags.shape)
df_tags.head()
(6138, 4)
Out[8]:
userId movieId tag timestamp
0 12 16 20060407 1144396544
1 12 16 robert de niro 1144396554
2 12 16 scorcese 1144396564
3 17 64116 movie to see 1234720092
4 21 260 action 1428011080

If we group by userId, movieId, and tag, we can see that users have tagged several movies and some movies are tagged more than once.

In [9]:
df_tags.groupby(['userId','movieId','tag']).count().head(10)
Out[9]:
timestamp
userId movieId tag
12 16 20060407 1
robert de niro 1
scorcese 1
17 64116 movie to see 1
21 260 action 1
politics 1
science fiction 1
296 dark humor 1
drugs 1
philosophical 1

Data Preparation

imdbpie

Given the above datasets, we can make use of the imdbId in the links file to gather more information on movies from the IMDb movie site.

We installed the imdbpie library to help us establish a connection with the IMdb API.

In [10]:
#Install imdbpie via pip package manager
#!pip install imdbpie

from imdbpie import Imdb

imdb = Imdb()
imdb = Imdb(anonymize=True) # to proxy requests

# Creating an instance with caching enabled
# Note that the cached responses expire every 2 hours or so.
# The API response itself dictates the expiry time)
imdb = Imdb(cache=True)

If we perform a Left Join between the links and movies file on movieId, we can associate all movies with its imdbId and movie title as shown below.

In [11]:
df_Link_Movie_join = pd.merge(df_links, df_movies, how='left',on='movieId')

print (df_Link_Movie_join.shape)
df_Link_Movie_join.head()
(10329, 5)
Out[11]:
movieId imdbId tmdbId title genres
0 1 114709 862.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 113497 8844.0 Jumanji (1995) Adventure|Children|Fantasy
2 3 113228 15602.0 Grumpier Old Men (1995) Comedy|Romance
3 4 114885 31357.0 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 113041 11862.0 Father of the Bride Part II (1995) Comedy

Furthermore, if we perform another Left Join between our ratings file and the joined Links & Movies file from above, we get a dataframe that incorporates userId, movieId, rating, imdbId, title, and genres for all movies that are rated by an user.

In [12]:
df_Ratings_Link_Movies_join = pd.merge(df_ratings, df_Link_Movie_join, how='left',on='movieId')

print (df_Ratings_Link_Movies_join.shape)
df_Ratings_Link_Movies_join.head()
(105339, 8)
Out[12]:
userId movieId rating timestamp imdbId tmdbId title genres
0 1 16 4.0 1217897793 112641 524.0 Casino (1995) Crime|Drama
1 1 24 1.5 1217895807 114168 12665.0 Powder (1995) Drama|Sci-Fi
2 1 32 4.0 1217896246 114746 63.0 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) Mystery|Sci-Fi|Thriller
3 1 47 4.0 1217896556 114369 807.0 Seven (a.k.a. Se7en) (1995) Mystery|Thriller
4 1 50 4.0 1217896523 114814 629.0 Usual Suspects, The (1995) Crime|Mystery|Thriller

To better understand the imdb API, we will query a movie based on its imdbId.

In this case, we take the imdbId of Casino (1995) to verify how the imdbId works and observe what information can be pulled from the API.

Note that the imdbID required zero padding and the 'tt' prefix for it to be a valid query

In [13]:
title = imdb.get_title_by_id("tt0112641")

print ('Title:',title.title)
print ('Rating:',title.rating)
print ('Certification:',title.certification)
print ('First Entry in Cast List:',title.cast_summary[0].name)
print ('First Entry in Cast List ID:',title.cast_summary[0].imdb_id)
print ('Length of the Cast List:',len(title.cast_summary))
('Title:', u'Casino')
('Rating:', 8.2)
('Certification:', u'R')
('First Entry in Cast List:', u'Robert De Niro')
('First Entry in Cast List ID:', u'nm0000134')
('Length of the Cast List:', 4)

Download additional data from IMdb API

For our recommendation engine, we will incorporate the imdb_rating, certification (PG, PG-13, R, etc.), and the top 4 actors in the movie as features.

The commented code below is used to download the above information for each movie. Since there are 10,329 movies, we split the downloads amongst ourselves as it would have taken more then 4.5 hours to download all movie info from one computer.

While downloading the data, we encountered several issues/error when the movie did not exist or actor information was unavailable, for these reasons, the code below was adjusted as we continued the download process.

In [14]:
# Used only for preprocessing, commented out for final runs

# df_links['imdb_rating'] = np.nan
# df_links['cert'] = np.nan
# df_links['Actor_0'] = np.nan
# df_links['Actor_1'] = np.nan
# df_links['Actor_2'] = np.nan
# df_links['Actor_3'] = np.nan
In [15]:
# Download ranges
# Daniel [0,3300]
# Josh [3300, 6600]
# Darrel [6600, 10329]
# Used only for preprocessing, commented out for final runs

# for i in range(0,10329):
#     movie = df_links['imdbId'][i]

#     title = imdb.get_title_by_id("tt"+str(movie).zfill(7))
#     print (i, movie, title)   
#     if (i == 8030 or i == 8659 or i == 9753 or i == 10047 or i == 10328) :
#         df_links['imdb_rating'].iloc[i] = np.nan
#         df_links['cert'].iloc[i] = np.nan
#         df_links['Actor_0'].iloc[i] = np.nan
#         df_links['Actor_1'].iloc[i] = np.nan
#         df_links['Actor_2'].iloc[i] = np.nan
#         df_links['Actor_3'].iloc[i] = np.nan        
#     else:
#         df_links['imdb_rating'].iloc[i] = title.rating
#         df_links['cert'].iloc[i] = title.certification
#         for j in range(0,len(title.cast_summary)):
#             df_links['Actor_'+str(j)].iloc[i] = title.cast_summary[j].name
In [16]:
# Used only for preprocessing, commented out for final runs

# df_links.to_csv('df_links_imdb_0_3300.csv',encoding = 'utf-8')

The code below concatenates our 3 download files from IMDb into 1.

In [17]:
# Daniel df_links_imdb_0_3300
df_imdb_1 = pd.read_csv('df_links_imdb_0_3300.csv')
print 'Set 1 Before:', df_imdb_1.shape

df_imdb_1.dropna(thresh=6, inplace=True)
print 'Set 2 After:', df_imdb_1.shape
Set 1 Before: (10329, 10)
Set 2 After: (3296, 10)
In [18]:
# Josh df_links_imdb_3300_6600.csv
df_imdb_2 = pd.read_csv('df_links_imdb_3300_6600.csv')
print 'Set 2 Before:', df_imdb_2.shape

df_imdb_2.dropna(thresh=6, inplace=True)
print 'Set 2 After:', df_imdb_2.shape
Set 2 Before: (10329, 10)
Set 2 After: (3293, 10)
In [19]:
# Darrel df_links_imdb_6600_End.csv
df_imdb_3 = pd.read_csv('df_links_imdb_6600_End.csv')
print 'Set 3 Before:', df_imdb_3.shape

df_imdb_3.dropna(thresh=6, inplace=True)
print 'Set 3 After:', df_imdb_3.shape
Set 3 Before: (10329, 10)
Set 3 After: (3719, 10)
In [20]:
#Concatenating all 3 files into 1
df_imdb = df_imdb_1.append([df_imdb_2, df_imdb_3])

# Delete unused dataframes to reduce memory usage and avoid confusion
del df_imdb_1
del df_imdb_2
del df_imdb_3

#Since we used the df_links dataframe to query from the API, we no longer need the df_links information
#We will keep the movieID (key), and data specifically from IMDb 
df_imdb = df_imdb[['movieId','imdb_rating','cert','Actor_0','Actor_1','Actor_2','Actor_3']]
print (df_imdb.shape)
df_imdb.head()
(10308, 7)
Out[20]:
movieId imdb_rating cert Actor_0 Actor_1 Actor_2 Actor_3
0 1 8.3 TV-G Tom Hanks Tim Allen Don Rickles Jim Varney
1 2 6.9 PG Robin Williams Kirsten Dunst Bonnie Hunt Jonathan Hyde
2 3 6.6 PG-13 Walter Matthau Jack Lemmon Ann-Margret Sophia Loren
3 4 5.6 R Whitney Houston Angela Bassett Loretta Devine Lela Rochon
4 5 5.9 PG Steve Martin Diane Keaton Martin Short Kimberly Williams-Paisley

Below, we confirm that no duplicate values are present in the df_imdb

In [21]:
# review results to ensure no duplicate movieId values exist
# no rows will be returned if there are no duplicate values

groups = df_imdb.groupby(by=['movieId'])
groups.filter(lambda x: len(x) > 1).sort_values(by='movieId')
Out[21]:
movieId imdb_rating cert Actor_0 Actor_1 Actor_2 Actor_3

Finalizing a Dataframe for Modeling

First, we perform another left join between our dataframe containing Ratings, Links, and Movies with out new IMDb data.

In [22]:
df_Ratings_Link_Movies_imdb_join = pd.merge(df_Ratings_Link_Movies_join, df_imdb, how='left',on='movieId')

print (df_Ratings_Link_Movies_imdb_join.shape)
df_Ratings_Link_Movies_imdb_join.head()
(105339, 14)
Out[22]:
userId movieId rating timestamp imdbId tmdbId title genres imdb_rating cert Actor_0 Actor_1 Actor_2 Actor_3
0 1 16 4.0 1217897793 112641 524.0 Casino (1995) Crime|Drama 8.2 R Robert De Niro Sharon Stone Joe Pesci James Woods
1 1 24 1.5 1217895807 114168 12665.0 Powder (1995) Drama|Sci-Fi 6.5 PG-13 Mary Steenburgen Sean Patrick Flanery Lance Henriksen Jeff Goldblum
2 1 32 4.0 1217896246 114746 63.0 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) Mystery|Sci-Fi|Thriller 8.1 R Bruce Willis Madeleine Stowe Brad Pitt Joseph Melito
3 1 47 4.0 1217896556 114369 807.0 Seven (a.k.a. Se7en) (1995) Mystery|Thriller 8.6 R Morgan Freeman Brad Pitt Kevin Spacey Andrew Kevin Walker
4 1 50 4.0 1217896523 114814 629.0 Usual Suspects, The (1995) Crime|Mystery|Thriller 8.6 R Kevin Spacey Gabriel Byrne Chazz Palminteri Stephen Baldwin

We will now rename our dataframe and remove the timestamp, imdbId, tmdId columns because our recomendations will not be based on time or IDs.

In [23]:
df = df_Ratings_Link_Movies_imdb_join
df = df[['userId','movieId','rating','title','genres','imdb_rating','cert','Actor_0','Actor_1','Actor_2','Actor_3']]
df.head()
Out[23]:
userId movieId rating title genres imdb_rating cert Actor_0 Actor_1 Actor_2 Actor_3
0 1 16 4.0 Casino (1995) Crime|Drama 8.2 R Robert De Niro Sharon Stone Joe Pesci James Woods
1 1 24 1.5 Powder (1995) Drama|Sci-Fi 6.5 PG-13 Mary Steenburgen Sean Patrick Flanery Lance Henriksen Jeff Goldblum
2 1 32 4.0 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) Mystery|Sci-Fi|Thriller 8.1 R Bruce Willis Madeleine Stowe Brad Pitt Joseph Melito
3 1 47 4.0 Seven (a.k.a. Se7en) (1995) Mystery|Thriller 8.6 R Morgan Freeman Brad Pitt Kevin Spacey Andrew Kevin Walker
4 1 50 4.0 Usual Suspects, The (1995) Crime|Mystery|Thriller 8.6 R Kevin Spacey Gabriel Byrne Chazz Palminteri Stephen Baldwin

Exploratory Data Analysis

The information below simply gave us an idea of how the data was distributed in the dataset

In [24]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 105339 entries, 0 to 105338
Data columns (total 11 columns):
userId         105339 non-null int64
movieId        105339 non-null int64
rating         105339 non-null float64
title          105339 non-null object
genres         105339 non-null object
imdb_rating    105240 non-null float64
cert           105120 non-null object
Actor_0        105228 non-null object
Actor_1        105011 non-null object
Actor_2        104878 non-null object
Actor_3        104846 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 9.6+ MB
In [25]:
df.rating.hist(bins = 60)
plt.title('MovieLens Ratings Histogram')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.xlim([0,5])
plt.show()

df.imdb_rating.hist(bins = 60)
plt.title('IMDb Ratings Histogram')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.xlim([0,10])
plt.show()
In [26]:
df.groupby('cert').count().sort_values('movieId', ascending=False).head(10)
Out[26]:
userId movieId rating title genres imdb_rating Actor_0 Actor_1 Actor_2 Actor_3
cert
R 42570 42570 42570 42570 42570 42570 42570 42570 42567 42554
PG-13 26224 26224 26224 26224 26224 26224 26224 26219 26218 26218
PG 16466 16466 16466 16466 16466 16466 16466 16456 16453 16451
G 3721 3721 3721 3721 3721 3721 3721 3717 3699 3699
Approved 2955 2955 2955 2955 2955 2955 2955 2955 2953 2953
Not Rated 2507 2507 2507 2507 2507 2507 2497 2493 2487 2478
TV-14 2138 2138 2138 2138 2138 2138 2138 2138 2138 2138
Unrated 2128 2128 2128 2128 2128 2128 2127 2005 1929 1929
TV-PG 1873 1873 1873 1873 1873 1873 1873 1873 1873 1872
TV-MA 1826 1826 1826 1826 1826 1826 1826 1807 1801 1801
In [27]:
df.groupby('userId').count().sort_values('movieId', ascending=False).head(10)
Out[27]:
movieId rating title genres imdb_rating cert Actor_0 Actor_1 Actor_2 Actor_3
userId
668 5678 5678 5678 5678 5669 5644 5667 5660 5651 5644
575 2837 2837 2837 2837 2833 2825 2833 2829 2824 2822
458 2086 2086 2086 2086 2085 2083 2085 2081 2078 2076
232 1421 1421 1421 1421 1421 1418 1420 1412 1408 1407
310 1287 1287 1287 1287 1286 1286 1285 1285 1283 1283
475 1249 1249 1249 1249 1247 1236 1247 1246 1244 1242
128 1231 1231 1231 1231 1229 1227 1229 1226 1220 1219
224 1182 1182 1182 1182 1181 1177 1180 1174 1169 1169
607 1176 1176 1176 1176 1175 1174 1174 1172 1170 1170
63 1107 1107 1107 1107 1107 1106 1107 1105 1103 1102

The code below is commented out because sometimes it causes the kernel to freeze

In [28]:
#df.groupby('genres').count().sort_values('movieId', ascending=False).head(10)
In [29]:
#df.groupby('Actor_0').count().sort_values('movieId', ascending=False).head(10)
In [30]:
#df.groupby('Actor_1').count().sort_values('movieId', ascending=False).head(10)
In [31]:
#df.groupby('Actor_2').count().sort_values('movieId', ascending=False).head(10)
In [32]:
#df.groupby('Actor_3').count().sort_values('movieId', ascending=False).head(10)

Modeling

From the GraphLab documentation: The user id and item id columns must be of type ‘int’ or ‘str’. The target column must be of type ‘int’ or ‘float’.

This is verified below. Also, once we drop rows containing null values, graphlab will not throw an error.

In [33]:
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 104759 entries, 0 to 105338
Data columns (total 11 columns):
userId         104759 non-null int64
movieId        104759 non-null int64
rating         104759 non-null float64
title          104759 non-null object
genres         104759 non-null object
imdb_rating    104759 non-null float64
cert           104759 non-null object
Actor_0        104759 non-null object
Actor_1        104759 non-null object
Actor_2        104759 non-null object
Actor_3        104759 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 9.6+ MB

Understanding GraphLab Recommenders

From GraphLab's Machine Learning >> Recommender Page, there are 5 types of recommenders supported by GraphLab:

  • item similarity models: item_similarity_recommender

    • A recommender that uses item-item similarities based on users in common
      • Obstacle: Documentation states: Side information for users and items is currently ignored by this model!
  • item content recommenders: item_content_recommender

    • A content-based recommender model in which the similarity between the items recommended is determined by the content of those items rather than learned from user interaction data. The similarity score between two items is calculated by first computing the similarity between the item data for each column, then taking a weighted average of the per-column similarities to get the final similarity. The recommendations are generated according to the average similarity of a candidate item to all the items in a user’s set of rated items.
      • WARNING: The ItemContentRecommender model is still in beta.
  • factorization recommenders: factorization_recommender

    • A Factorization-based recommender that learns latent factors for each user and item and uses them to make rating predictions. This includes both standard matrix factorization as well as factorization machines models (in the situation where side data is available for users and/or items).

    • Supports side_data_factorization: Use factorization for modeling any additional features beyond the user and item columns. If True, and side features or any additional columns are present, then a Factorization Machine model is trained. Otherwise, only the linear terms are fit to these features. Default: True.

  • factortization recommenders for ranking: ranking_factorization_recommender

    • A Ranking Factorization Recommender learns latent factors for each user and item and uses them to make rating predictions.
    • The main difference between this and the factorization_recommender is the ranking_regularization parameter:
      • Penalizes the predicted value of user-item pairs not in the training set. Larger values increase this penalization. Suggeseted values: 0, 0.1, 0.5, 1. NOTE: if no target column is present, this parameter is ignored.
    • Supports side_data_factorization.
  • popularity-based recommenders: popularity_recommender

    • A model that makes recommendations using item popularity. When no target column is provided, the popularity is determined by the number of observations involving each item. When a target is provided, popularity is computed using the item’s mean target value. When the target column contains ratings, for example, the model computes the mean rating for each item and uses this to rank items for recommendations.
    • In this case, our target is the user's movieLens Rating

Model Evaluation

Model 1: Movie Recommendation System

We wish to evaluate several types of recommenders for our movie recommendation model.

Specifically:

  • Item similarity
  • factorization recommender
  • ranking factorization recommender
  • popularity-based

The performance of each model will be evaluated via a precision-recall curve. The model that scores highest on both measures is preferred.

We define the following variables below:

  • data: entire movieLens + IMDb data
  • item_data: Additional movie characteristics to be included in the recommender
  • train, test: a single 80/20 train test split to evaluate models
In [34]:
import graphlab as gl
gl.canvas.set_target('ipynb')

#Creating an SFrame with our movie data, df
data = gl.SFrame(data=df)

#Defining the side information for our movies/items
item_data = data[['title','genres','imdb_rating','cert','Actor_0','Actor_1','Actor_2','Actor_3']]

# Split the data into a single training and test set
train, test = gl.recommender.util.random_split_by_user(data,
                                                       user_id="userId", 
                                                       item_id="title",
                                                       max_num_users=None, #None: use all available users for test set
                                                       item_test_proportion=0.2) #80/20 train/test split
This non-commercial license of GraphLab Create for academic use is assigned to drodriguezgo@smu.edu and will expire on June 06, 2017.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1471552520.log

graphlab.recommender.create

The graphlab.recommender.create is a unified interface for training recommender models. Based on simple characteristics of the data, a type of model is selected and trained. The trained model can be used to predict ratings and make recommendations.

First, we will build two models based on the standard recommender by GraphLab:

  • graphlab.recommender.create - without item_data included in the model
  • graphlab.recommender.create - with item_data included in the model

graphlab.recommender.create - No Side Data Included

The recommender.create selects a ranking factorization model.

In [61]:
#Train a model based on characteristics of the data, a ranking_factorization_recommender is selected
user_item_rec_create_noside = gl.recommender.create(train, 
                                user_id="userId", 
                                item_id="title", 
                                target="rating")
Recsys training: model = ranking_factorization_recommender
Preparing data set.
    Data has 83791 observations with 668 users and 9380 items.
    Data prepared in: 0.569409s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
| max_iterations                 | Maximum Number of Iterations                     | 25       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 10473 / 83791 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 5                 | Not Viable                               |
| 1       | 1.25              | Not Viable                               |
| 2       | 0.3125            | Not Viable                               |
| 3       | 0.078125          | Not Viable                               |
| 4       | 0.0195312         | 1.00496                                  |
| 5       | 0.00976562        | 1.31091                                  |
| 6       | 0.00488281        | 1.4643                                   |
| 7       | 0.00244141        | 1.65076                                  |
+---------+-------------------+------------------------------------------+
| Final   | 0.0195312         | 1.00496                                  |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 87us         | 2.13545           | 1.04367               |             |
+---------+--------------+-------------------+-----------------------+-------------+
| 1       | 447.52ms     | 1.62334           | 0.970261              | 0.0195312   |
| 2       | 857.176ms    | 1.35394           | 0.908603              | 0.0195312   |
| 3       | 1.34s        | 1.21143           | 0.872592              | 0.0195312   |
| 4       | 1.94s        | 1.11663           | 0.848401              | 0.0195312   |
| 5       | 2.45s        | 1.04494           | 0.827782              | 0.0195312   |
| 6       | 2.84s        | 0.996501          | 0.811301              | 0.0195312   |
| 10      | 4.32s        | 0.861232          | 0.765821              | 0.0195312   |
| 11      | 4.70s        | 0.837391          | 0.756374              | 0.0195312   |
| 15      | 6.19s        | 0.762996          | 0.72681               | 0.0195312   |
| 20      | 8.02s        | 0.696519          | 0.698353              | 0.0195312   |
| 25      | 9.75s        | 0.647268          | 0.67475               | 0.0195312   |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 0.611829
       Final training RMSE: 0.648842

graphlab.recommender.create - Side Data Included

The recommender.create selects a ranking factorization model.

In [62]:
#Train a model based on characteristics of the data, a ranking_factorization_recommender is selected by default
user_item_rec_create_side = gl.recommender.create(train, 
                                user_id="userId", 
                                item_id="title", 
                                target="rating", 
                                item_data=item_data) #side data included
Recsys training: model = ranking_factorization_recommender
Preparing data set.
    Data has 83791 observations with 668 users and 10143 items.
    Data prepared in: 0.838758s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
| side_data_factorization        | Assign Factors for Side Data                     | True     |
| max_iterations                 | Maximum Number of Iterations                     | 25       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 10473 / 83791 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 2.94118           | Not Viable                               |
| 1       | 0.735294          | Not Viable                               |
| 2       | 0.183824          | Not Viable                               |
| 3       | 0.0459559         | 0.516296                                 |
| 4       | 0.0229779         | 0.627851                                 |
| 5       | 0.011489          | 0.829658                                 |
| 6       | 0.00574449        | 1.09745                                  |
+---------+-------------------+------------------------------------------+
| Final   | 0.0459559         | 0.516296                                 |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 90us         | 2.1355            | 1.04339               |             |
+---------+--------------+-------------------+-----------------------+-------------+
| 1       | 1.15s        | DIVERGED          | DIVERGED              | 0.0459559   |
| RESET   | 1.87s        | 2.13507           | 1.04335               |             |
| 1       | 3.12s        | 1.4621            | 1.00546               | 0.0229779   |
| 2       | 4.33s        | 1.03823           | 0.871441              | 0.0229779   |
| 3       | 5.30s        | 0.842806          | 0.801709              | 0.0229779   |
| 4       | 6.05s        | 0.721869          | 0.752209              | 0.0229779   |
| 5       | 6.91s        | 0.63203           | 0.711397              | 0.0229779   |
| 6       | 7.62s        | 0.566961          | 0.679458              | 0.0229779   |
| 9       | 9.93s        | 0.433563          | 0.604159              | 0.0229779   |
| 11      | 11.30s       | 0.376405          | 0.566094              | 0.0229779   |
| 14      | 13.20s       | 0.314781          | 0.521699              | 0.0229779   |
| 19      | 16.54s       | 0.252087          | 0.47041               | 0.0229779   |
| 24      | 19.31s       | 0.214348          | 0.435873              | 0.0229779   |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 0.188542
       Final training RMSE: 0.406837

factorization_recommender: No side data

Since the ranking factorization methods were chosen above, we will build two models, with and without, side data using a factorization recommender.

In [63]:
user_item_factor_noside = gl.factorization_recommender.create(train, 
                                user_id="userId", 
                                item_id="title", 
                                target="rating")
Recsys training: model = factorization_recommender
Preparing data set.
    Data has 83791 observations with 668 users and 9380 items.
    Data prepared in: 0.490065s
Training factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 8        |
| regularization                 | L2 Regularization on Factors                     | 1e-08    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
| max_iterations                 | Maximum Number of Iterations                     | 50       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 10473 / 83791 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 5                 | Not Viable                               |
| 1       | 1.25              | Not Viable                               |
| 2       | 0.3125            | Not Viable                               |
| 3       | 0.078125          | 0.188472                                 |
| 4       | 0.0390625         | 0.326477                                 |
| 5       | 0.0195312         | 0.539054                                 |
+---------+-------------------+------------------------------------------+
| Final   | 0.078125          | 0.188472                                 |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 148us        | 1.08925           | 1.04367               |             |
+---------+--------------+-------------------+-----------------------+-------------+
| 1       | 136.563ms    | 0.853257          | 0.923709              | 0.078125    |
| 2       | 234.873ms    | 0.634491          | 0.796538              | 0.078125    |
| 3       | 353.194ms    | 0.550162          | 0.741716              | 0.078125    |
| 4       | 446.254ms    | 0.50001           | 0.707099              | 0.078125    |
| 5       | 535.376ms    | 0.463896          | 0.681082              | 0.078125    |
| 6       | 638.948ms    | 0.433246          | 0.658195              | 0.078125    |
| 11      | 1.06s        | 0.360105          | 0.600062              | 0.078125    |
| 25      | 2.19s        | 0.305428          | 0.552621              | 0.078125    |
| 50      | 4.98s        | 0.279198          | 0.52835               | 0.078125    |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 0.258174
       Final training RMSE: 0.508065

factorization_recommender: Side Data

In [64]:
#Train a model based on characteristics of the data, a ranking_factorization_recommender is selected by default
user_item_factor_side = gl.factorization_recommender.create(train, 
                                user_id="userId", 
                                item_id="title", 
                                target="rating", 
                                item_data=item_data) #side data included
Recsys training: model = factorization_recommender
Preparing data set.
    Data has 83791 observations with 668 users and 10143 items.
    Data prepared in: 0.814341s
Training factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 8        |
| regularization                 | L2 Regularization on Factors                     | 1e-08    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
| side_data_factorization        | Assign Factors for Side Data                     | True     |
| max_iterations                 | Maximum Number of Iterations                     | 50       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 10473 / 83791 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 2.94118           | Not Viable                               |
| 1       | 0.735294          | Not Viable                               |
| 2       | 0.183824          | Not Viable                               |
| 3       | 0.0459559         | 0.367195                                 |
| 4       | 0.0229779         | 0.434019                                 |
| 5       | 0.011489          | 0.510973                                 |
| 6       | 0.00574449        | 0.617132                                 |
+---------+-------------------+------------------------------------------+
| Final   | 0.0459559         | 0.367195                                 |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 81us         | 1.08857           | 1.04334               |             |
+---------+--------------+-------------------+-----------------------+-------------+
| 1       | 169.696ms    | 0.882051          | 0.93917               | 0.0459559   |
| 2       | 290.345ms    | 0.68608           | 0.828295              | 0.0459559   |
| 3       | 427.551ms    | 0.618408          | 0.786384              | 0.0459559   |
| 4       | 547.977ms    | 0.573039          | 0.756987              | 0.0459559   |
| 5       | 670.554ms    | 0.534543          | 0.731117              | 0.0459559   |
| 6       | 855.239ms    | 0.500347          | 0.707343              | 0.0459559   |
| 10      | 1.52s        | 0.415707          | 0.644741              | 0.0459559   |
| 11      | 1.65s        | 0.402497          | 0.634413              | 0.0459559   |
| 20      | 2.77s        | 0.342511          | 0.585226              | 0.0459559   |
| 30      | 4.09s        | 0.316487          | 0.56255               | 0.0459559   |
| 40      | 5.17s        | 0.302109          | 0.549621              | 0.0459559   |
| 50      | 6.36s        | 0.292949          | 0.541222              | 0.0459559   |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 0.270973
       Final training RMSE: 0.520524

item_similarity_recommender: No side data

Next, we will build a item similarity model. Currently, this type of model does not include side data.

In [37]:
item_item_noside = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  similarity_type="cosine")
Recsys training: model = item_similarity
Warning: Ignoring columns movieId, genres, imdb_rating, cert, Actor_0, Actor_1, Actor_2, Actor_3;
    To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
    Data has 83791 observations with 668 users and 9380 items.
    Data prepared in: 0.165226s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 11.649ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 204.204ms                           | 0                | 1               |
| 1.80s                               | 100              | 9380            |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 1.82664s

popularity_recommender: No side data

In [65]:
popularity_noside = gl.recommender.popularity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating")
Recsys training: model = popularity
Warning: Ignoring columns movieId, genres, imdb_rating, cert, Actor_0, Actor_1, Actor_2, Actor_3;
    To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
    Data has 83791 observations with 668 users and 9380 items.
    Data prepared in: 0.197333s
83791 observations to process; with 9380 unique items.

popularity_recommender: Side data

Although side item_data is included, the warning below indicates that these variables are ignored. However, the data preparation took more than 2x time.

In [66]:
popularity_side = gl.recommender.popularity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating",
                                  item_data=item_data) #side-data included
Recsys training: model = popularity
Warning: Ignoring columns movieId, genres, imdb_rating, cert, Actor_0, Actor_1, Actor_2, Actor_3;
    To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
    Data has 83791 observations with 668 users and 10143 items.
    Data prepared in: 0.475438s
83791 observations to process; with 10143 unique items.

Preliminary Model Comparisons

From the precision-recall plot below, the two popularity models score poorly, as well as the two factorization only models.

Out of the 7 models tested, 3 are viable. The item-item model performed best, followed by the recommender.create model with side data.

In [67]:
first_models = [user_item_rec_create_noside,
                user_item_rec_create_side,
                user_item_factor_noside,
                user_item_factor_side,
                item_item_noside,
                popularity_noside,
                popularity_side]

comparisonstruct = gl.compare(test,first_models)

gl.show_comparison(comparisonstruct,first_models)
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0615615615616 | 0.00229283370863 |
|   2    | 0.0653153153153 | 0.00470330650184 |
|   3    | 0.0635635635636 | 0.00684320300261 |
|   4    | 0.0656906906907 | 0.0103954695167  |
|   5    | 0.0621621621622 | 0.0130008248391  |
|   6    | 0.0608108108108 | 0.0150205519956  |
|   7    | 0.0592020592021 |  0.01702330078   |
|   8    | 0.0579954954955 | 0.0189367364077  |
|   9    | 0.0583917250584 | 0.0209762874573  |
|   10   | 0.0567567567568 | 0.0220041826419  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    | 0.132132132132 | 0.0052845915435  |
|   2    | 0.125375375375 | 0.00896729444725 |
|   3    | 0.11961961962  | 0.0136858687787  |
|   4    | 0.112237237237 | 0.0172485755717  |
|   5    | 0.110510510511 | 0.0208701712128  |
|   6    | 0.105855855856 | 0.0240622731683  |
|   7    | 0.102745602746 | 0.0270247934054  |
|   8    | 0.103791291291 | 0.0305715403244  |
|   9    | 0.101434768101 |  0.03415270983   |
|   10   | 0.098048048048 | 0.0365416587434  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M2

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    | 0.0045045045045  | 7.51005652966e-05 |
|   2    | 0.00525525525526 | 0.000159094273963 |
|   3    | 0.00800800800801 | 0.000316629598007 |
|   4    | 0.00863363363363 | 0.000718283122343 |
|   5    | 0.00750750750751 | 0.000742853146913 |
|   6    | 0.00775775775776 |  0.00077642704203 |
|   7    | 0.00729300729301 |  0.00113290596994 |
|   8    | 0.00731981981982 |  0.00125700973363 |
|   9    | 0.00717384050717 |  0.0014366634771  |
|   10   | 0.00660660660661 |  0.00149441353485 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M3

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    | 0.000750750750751 |  0.00010725010725 |
|   3    |   0.001001001001  | 0.000169812669813 |
|   4    | 0.000750750750751 | 0.000169812669813 |
|   5    |  0.0012012012012  | 0.000222210605506 |
|   6    |  0.00125125125125 | 0.000235381671308 |
|   7    |   0.001716001716  | 0.000325809768796 |
|   8    |  0.00206456456456 | 0.000455098068119 |
|   9    |  0.00266933600267 | 0.000800321308281 |
|   10   |  0.00255255255255 | 0.000879347703097 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M4

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.304804804805 | 0.0201890449299 |
|   2    | 0.291291291291 | 0.0372851489488 |
|   3    | 0.272772772773 | 0.0497996247589 |
|   4    | 0.259384384384 | 0.0625066240519 |
|   5    | 0.247147147147 | 0.0727755686591 |
|   6    | 0.241741741742 | 0.0836995508043 |
|   7    | 0.234663234663 |  0.094436512532 |
|   8    | 0.226914414414 |  0.105534332016 |
|   9    | 0.22022022022  |   0.1143472882  |
|   10   | 0.216216216216 |  0.123719482686 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M5

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |  0.0015015015015  | 1.31942135457e-06 |
|   2    | 0.000750750750751 | 1.31942135457e-06 |
|   3    | 0.000500500500501 | 1.31942135457e-06 |
|   4    | 0.000375375375375 | 1.31942135457e-06 |
|   5    |  0.0003003003003  | 1.31942135457e-06 |
|   6    |  0.00025025025025 | 1.31942135457e-06 |
|   7    |  0.0002145002145  | 1.31942135457e-06 |
|   8    | 0.000375375375375 | 7.37386289288e-06 |
|   9    | 0.000333667000334 | 7.37386289288e-06 |
|   10   |  0.0003003003003  | 7.37386289288e-06 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M6

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    |        0.0        |        0.0        |
|   3    |        0.0        |        0.0        |
|   4    |        0.0        |        0.0        |
|   5    |        0.0        |        0.0        |
|   6    |        0.0        |        0.0        |
|   7    |        0.0        |        0.0        |
|   8    | 0.000187687687688 |  2.3460960961e-05 |
|   9    | 0.000166833500167 |  2.3460960961e-05 |
|   10   |  0.0003003003003  | 2.95154024993e-05 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

Model Optimization

In the following section, we will try several parameter combinations for each of the model types above to see if the models maintain their performance rankings in general.

5 models for each model type will be randomly selected from graphlab's random grid search method. We used a random grid search due to computing constraints.

factorization_recommender Model Optimization via a Grid Search process

In [68]:
# Define model parameters
params = {'user_id': 'userId', 
          'item_id': 'title', 
          'target': 'rating',
          'item_data':[item_data,None],
          'num_factors': [6, 12, 24], 
          'regularization':[1e-12,1e-8,1e-4,1],
          'linear_regularization': [1e-12,1e-8,1e-4,1]}

fac_rec_gs = gl.model_parameter_search.random_search.create((train,test),
        gl.recommender.factorization_recommender.create,
        params,
        max_models=5,
        environment=None)
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-15-56-3200000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-15-56-3200000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-18-2016-15-56-3200000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-18-2016-15-56-3200000-17581'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-15-56-3200000-17581' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-15-56-3200000-17581' scheduled.
In [69]:
fac_rec_gs.get_results()
Out[69]:
model_id item_data item_id linear_regularization num_factors regularization target
1 [{'genres':
'Crime|Drama', 'title': ...
title 1e-12 12 0.0001 rating
0 [{'genres':
'Crime|Drama', 'title': ...
title 1.0 24 1e-12 rating
3 None title 1e-12 12 1.0 rating
2 None title 1e-08 24 1e-08 rating
4 None title 0.0001 6 1.0 rating
user_id training_precision@5 training_recall@5 training_rmse validation_precision@5 validation_recall@5
userId 0.0149700598802 0.000967976098407 0.482986418926 0.00990990990991 0.00128034949862
userId 0.264371257485 0.0192796228924 0.936994344676 0.118618618619 0.0274590946162
userId 0.194011976048 0.0115850446607 0.878708846308 0.0690690690691 0.0141406274757
userId 0.0284431137725 0.00129283355423 0.270712526646 0.00990990990991 0.00111836975672
userId 0.194011976048 0.0115850446607 0.878867403523 0.0693693693694 0.0141539150996
validation_rmse
0.943076880732
0.964633662986
0.894355959825
1.12784091905
0.894438161547
[5 rows x 14 columns]

ranking_factorization_recommender Model Optimization via a Grid Search process

In [70]:
# Define model parameters
params = {'user_id': 'userId', 
          'item_id': 'title', 
          'target': 'rating',
          'item_data':[item_data,None],
          'num_factors': [6, 12, 24], 
          'regularization':[1e-12,1e-8,1e-4,1],
          'linear_regularization': [1e-12,1e-8,1e-4,1],
          'ranking_regularization':[0, 0.1, 0.5, 1]}

ranking_fac_rec_gs = gl.model_parameter_search.random_search.create((train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=5,
        environment=None)
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-15-59-5300000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-15-59-5300000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-18-2016-15-59-5300000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-18-2016-15-59-5300000-07a25'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-15-59-5300000-07a25' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-15-59-5300000-07a25' scheduled.
In [71]:
ranking_fac_rec_gs.get_results()
Out[71]:
model_id item_data item_id linear_regularization num_factors ranking_regularization
1 [{'genres':
'Crime|Drama', 'title': ...
title 1.0 6 0.5
0 None title 1.0 6 0.1
3 [{'genres':
'Crime|Drama', 'title': ...
title 0.0001 12 0.5
2 None title 1e-12 6 1.0
4 None title 0.0001 24 0.1
regularization target user_id training_precision@5 training_recall@5 training_rmse validation_precision@5
1e-12 rating userId 0.320658682635 0.0237969973416 1.07935542321 0.162462462462
0.0001 rating userId 0.210479041916 0.0173362449934 0.922887965497 0.0741741741742
0.0001 rating userId 0.247904191617 0.0156825314556 0.636872704193 0.118618618619
1e-08 rating userId 0.308682634731 0.0206942246112 0.825089056587 0.145045045045
1e-08 rating userId 0.296407185629 0.0227116956155 0.709077971315 0.12972972973
validation_recall@5 validation_rmse
0.0317517169678 1.09047994665
0.0178872770859 0.944159485506
0.0213295934706 0.976678599457
0.0294316953978 1.01915518188
0.0281708980736 0.882326023915
[5 rows x 15 columns]

item_similarity_recommender Model Optimization via a Grid Search process

In [72]:
# Define model parameters
params = {'user_id': 'userId', 
          'item_id': 'title', 
          'target': 'rating',
          'similarity_type':['jaccard','cosine','pearson'],
          'only_top_k':[5,10,25,64]}#only_top_k: Number of similar items to store for each item. Default value is 64. 
                                    #Decreasing this decreases the amount of memory required for the model, 
                                    #but may also decrease the accuracy.

item_item_gs = gl.model_parameter_search.random_search.create((train,test),
        gl.recommender.item_similarity_recommender.create,
        params,
        max_models=5,
        environment=None)
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-16-03-2000000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-16-03-2000000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-18-2016-16-03-2000000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-18-2016-16-03-2000000-1cf12'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-16-03-2000000-1cf12' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-16-03-2000000-1cf12' scheduled.

Optimized Model Comparison

In [73]:
#all models are placed into the list below
model_List = []

#factorization_recommender
fac_rec = fac_rec_gs.get_models()

#ranking_factorization_recommender
ranking_fac_rec = ranking_fac_rec_gs.get_models()

#item_similarity recommender
item_item_models = item_item_gs.get_models()

#first set of models in list are model_0-4
model_List = [model for model in fac_rec]

#second set of models in list are model_5-9
model_List = model_List + [model for model in ranking_fac_rec]

#second set of models in list are model_10-14
model_List = model_List + [model for model in item_item_models]
In [74]:
comparison_struct = gl.compare(test, model_List)

gl.show_comparison(comparison_struct, model_List)
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.157657657658 | 0.0078630650609 |
|   2    | 0.132132132132 | 0.0127093155349 |
|   3    | 0.121121121121 |  0.015808250389 |
|   4    | 0.118243243243 | 0.0216692738583 |
|   5    | 0.118618618619 | 0.0274590946162 |
|   6    | 0.113863863864 | 0.0299599647576 |
|   7    | 0.109395109395 | 0.0341096571281 |
|   8    | 0.106043543544 | 0.0369451103248 |
|   9    | 0.103603603604 | 0.0391620061994 |
|   10   |  0.1003003003  | 0.0420240595732 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    | 0.00900900900901 | 0.000251428771958 |
|   2    | 0.00900900900901 | 0.000576948992459 |
|   3    | 0.0105105105105  | 0.000908011927434 |
|   4    | 0.0105105105105  |  0.0011595124766  |
|   5    | 0.00990990990991 |  0.00128034949862 |
|   6    | 0.0107607607608  |  0.00179928022143 |
|   7    | 0.00965250965251 |  0.00182334966554 |
|   8    | 0.00938438438438 |  0.0018831044243  |
|   9    | 0.00900900900901 |  0.00197712421947 |
|   10   | 0.00855855855856 |  0.00210136037311 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M2

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    |  0.018018018018  | 0.000386520191355 |
|   2    | 0.0135135135135  | 0.000627476041926 |
|   3    |  0.011011011011  | 0.000745312117398 |
|   4    | 0.00975975975976 | 0.000918476819294 |
|   5    | 0.00990990990991 |  0.00111836975672 |
|   6    |  0.01001001001   |  0.00135994029036 |
|   7    | 0.0100815100815  |  0.00166952653313 |
|   8    | 0.0108858858859  |  0.00199997374415 |
|   9    | 0.0111778445112  |  0.00257863874344 |
|   10   | 0.0108108108108  |  0.00263729953426 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M3

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0945945945946 | 0.00319641894159 |
|   2    | 0.0885885885886 | 0.00623695808988 |
|   3    | 0.0785785785786 | 0.00809331110205 |
|   4    | 0.0750750750751 |  0.011364843754  |
|   5    | 0.0690690690691 | 0.0141406274757  |
|   6    | 0.0650650650651 | 0.0162994023502  |
|   7    | 0.0615615615616 | 0.0176346145981  |
|   8    | 0.0587462462462 | 0.0190125032587  |
|   9    | 0.0582248915582 | 0.0204006485935  |
|   10   | 0.0566066066066 | 0.0222639543244  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M4

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0945945945946 | 0.00319641894159 |
|   2    | 0.0893393393393 | 0.00624622661767 |
|   3    | 0.0790790790791 | 0.0081016993786  |
|   4    | 0.0758258258258 | 0.0113841858195  |
|   5    | 0.0693693693694 | 0.0141539150996  |
|   6    | 0.0655655655656 | 0.0163280049731  |
|   7    | 0.0622050622051 | 0.0176711794049  |
|   8    | 0.0589339339339 | 0.0190003359397  |
|   9    | 0.0588922255589 | 0.0205180735336  |
|   10   | 0.0578078078078 | 0.0225834157225  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M5

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0915915915916 | 0.00439586201482 |
|   2    | 0.0885885885886 | 0.00859852817495 |
|   3    | 0.0835835835836 | 0.0125214060237  |
|   4    | 0.0773273273273 | 0.0153017979303  |
|   5    | 0.0741741741742 | 0.0178872770859  |
|   6    | 0.0725725725726 | 0.0212429536079  |
|   7    | 0.0690690690691 | 0.0229698544229  |
|   8    | 0.0671921921922 | 0.0253657090606  |
|   9    | 0.0672339005672 | 0.0281185753942  |
|   10   | 0.0663663663664 | 0.0317437872311  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M6

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.205705705706 | 0.0087730426516 |
|   2    | 0.195195195195 | 0.0154530122891 |
|   3    | 0.179179179179 | 0.0213391052143 |
|   4    | 0.171546546547 |  0.026856336894 |
|   5    | 0.162462462462 | 0.0317517169678 |
|   6    | 0.154654654655 | 0.0364588807619 |
|   7    | 0.151437151437 | 0.0423365573191 |
|   8    | 0.15015015015  | 0.0504819081976 |
|   9    | 0.147981314648 | 0.0556661926144 |
|   10   | 0.143693693694 | 0.0593070169993 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M7

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    | 0.201201201201 | 0.00811465307598 |
|   2    | 0.18018018018  | 0.0152172278445  |
|   3    | 0.16016016016  | 0.0199132274716  |
|   4    | 0.152402402402 | 0.0252074597685  |
|   5    | 0.145045045045 | 0.0294316953978  |
|   6    | 0.138888888889 | 0.0330129989429  |
|   7    | 0.133848133848 |  0.035914158606  |
|   8    | 0.125938438438 | 0.0382550882574  |
|   9    | 0.125792459126 | 0.0421826912701  |
|   10   | 0.123273273273 | 0.0456581922298  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M8

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    | 0.127627627628 | 0.00357523718393 |
|   2    | 0.135135135135 | 0.00884532201438 |
|   3    | 0.132632632633 |  0.015006399823  |
|   4    | 0.123873873874 | 0.0181365439476  |
|   5    | 0.118618618619 | 0.0213295934706  |
|   6    | 0.117117117117 | 0.0262547490218  |
|   7    | 0.113685113685 | 0.0292829941978  |
|   8    | 0.109046546547 | 0.0311599698117  |
|   9    | 0.105605605606 | 0.0331326133995  |
|   10   | 0.101651651652 | 0.0361695895585  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M9

Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    | 0.166666666667 | 0.00744216616515 |
|   2    | 0.153903903904 | 0.0132199297613  |
|   3    | 0.143643643644 | 0.0192614577984  |
|   4    | 0.134384384384 | 0.0235996743323  |
|   5    | 0.12972972973  | 0.0281708980736  |
|   6    | 0.122872872873 | 0.0311711678338  |
|   7    | 0.119476619477 | 0.0350360905683  |
|   8    | 0.114301801802 | 0.0383291218832  |
|   9    | 0.11044377711  |  0.041351585797  |
|   10   | 0.107507507508 | 0.0446965152258  |
+--------+----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M10

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.339339339339 | 0.0230477513501 |
|   2    | 0.309309309309 |  0.038741254887 |
|   3    | 0.295795795796 |  0.05392083358  |
|   4    | 0.278153153153 | 0.0672936272046 |
|   5    | 0.268168168168 | 0.0786382444111 |
|   6    | 0.25950950951  | 0.0912010893658 |
|   7    | 0.254826254826 |  0.103540171202 |
|   8    | 0.246996996997 |  0.114014691828 |
|   9    | 0.24024024024  |  0.124027161344 |
|   10   | 0.232432432432 |  0.133058244132 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M11

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.292792792793 | 0.0203029488856 |
|   2    | 0.273273273273 | 0.0354953222789 |
|   3    | 0.261261261261 |  0.048005259029 |
|   4    | 0.256756756757 | 0.0617723158023 |
|   5    | 0.248348348348 | 0.0734077673138 |
|   6    | 0.237237237237 | 0.0828833697521 |
|   7    | 0.231016731017 | 0.0921685022839 |
|   8    | 0.225225225225 |  0.102013608452 |
|   9    | 0.22038705372  |  0.111152250073 |
|   10   | 0.215915915916 |  0.120387018809 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M12

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.333333333333 | 0.0221296542743 |
|   2    | 0.298048048048 |  0.039372049276 |
|   3    | 0.274274274274 | 0.0517849283232 |
|   4    | 0.261636636637 | 0.0649352497504 |
|   5    | 0.253453453453 |  0.076942355395 |
|   6    | 0.247247247247 | 0.0893544492244 |
|   7    | 0.23830973831  |  0.10199223712  |
|   8    | 0.232357357357 |  0.112325888004 |
|   9    | 0.227394060727 |  0.121468255771 |
|   10   | 0.21966966967  |  0.128801990461 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M13

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    | 0.000750750750751 | 1.02142959286e-05 |
|   3    |   0.001001001001  | 7.27768584911e-05 |
|   4    |  0.00112612612613 | 7.40962798457e-05 |
|   5    | 0.000900900900901 | 7.40962798457e-05 |
|   6    |  0.00125125125125 | 0.000149829908211 |
|   7    |  0.0010725010725  | 0.000149829908211 |
|   8    | 0.000938438438438 | 0.000149829908211 |
|   9    | 0.000834167500834 | 0.000149829908211 |
|   10   | 0.000750750750751 | 0.000149829908211 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M14

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.34984984985  | 0.0241785148085 |
|   2    | 0.303303303303 | 0.0403291907005 |
|   3    | 0.280780780781 | 0.0533188349805 |
|   4    | 0.262762762763 | 0.0646750460207 |
|   5    | 0.254054054054 | 0.0785363402525 |
|   6    | 0.243493493493 | 0.0891918707425 |
|   7    | 0.234877734878 |  0.099245257038 |
|   8    | 0.225412912913 |  0.107603475899 |
|   9    | 0.219386052719 |  0.116475412269 |
|   10   | 0.211111111111 |  0.121931750006 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

As expected, the item-item recommender model outperforms all other models consistently.

  • first set of models in list are model_0-4: factorization_recommender
  • second set of models in list are model_5-9: ranking_factorization_recommender
  • third set of models in list are model_10-14: item_similarity

The models below show the results of an item_similarity model. The results are not unexpected and seem to make sense.

In [100]:
results = item_item_noside.get_similar_items(k=5)

results
Out[100]:
title similar score rank
Casino (1995) Goodfellas (1990) 0.440444588661 1
Casino (1995) Reservoir Dogs (1992) 0.395015716553 2
Casino (1995) Pulp Fiction (1994) 0.37524074316 3
Casino (1995) Ed Wood (1994) 0.357009768486 4
Casino (1995) True Romance (1993) 0.355275392532 5
Powder (1995) Dr. Dolittle (1998) 0.33680254221 1
Powder (1995) Program, The (1993) 0.32738506794 2
Powder (1995) Entrapment (1999) 0.318438768387 3
Powder (1995) Arachnophobia (1990) 0.317321956158 4
Powder (1995) Above the Rim (1994) 0.312942564487 5
[46900 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [97]:
results[results['title'] == 'Pulp Fiction (1994)']
Out[97]:
title similar score rank
Pulp Fiction (1994) Silence of the Lambs, The
(1991) ...
0.583420097828 1
Pulp Fiction (1994) Shawshank Redemption, The
(1994) ...
0.582681536674 2
Pulp Fiction (1994) Seven (a.k.a. Se7en)
(1995) ...
0.529344499111 3
Pulp Fiction (1994) Fugitive, The (1993) 0.521369814873 4
Pulp Fiction (1994) Batman (1989) 0.507000148296 5
[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

Model 2: Actor Recommender for Casting Agents

Testing Recommendation Engines for Actors given a Genre

In [75]:
#Original Dataframe
print (df.shape)
df.head()
(104759, 11)
Out[75]:
userId movieId rating title genres imdb_rating cert Actor_0 Actor_1 Actor_2 Actor_3
0 1 16 4.0 Casino (1995) Crime|Drama 8.2 R Robert De Niro Sharon Stone Joe Pesci James Woods
1 1 24 1.5 Powder (1995) Drama|Sci-Fi 6.5 PG-13 Mary Steenburgen Sean Patrick Flanery Lance Henriksen Jeff Goldblum
2 1 32 4.0 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) Mystery|Sci-Fi|Thriller 8.1 R Bruce Willis Madeleine Stowe Brad Pitt Joseph Melito
3 1 47 4.0 Seven (a.k.a. Se7en) (1995) Mystery|Thriller 8.6 R Morgan Freeman Brad Pitt Kevin Spacey Andrew Kevin Walker
4 1 50 4.0 Usual Suspects, The (1995) Crime|Mystery|Thriller 8.6 R Kevin Spacey Gabriel Byrne Chazz Palminteri Stephen Baldwin

Defining Actor_Genre dataframe

First, we wish to include all actors in a single columns for the movie they participate in

In [133]:
data_1=data['genres', 'movieId', 'rating', 'Actor_0']
data_1.rename({'Actor_0': 'Actor'})

data_2=data['genres', 'movieId', 'rating', 'Actor_1']
data_2.rename({'Actor_1': 'Actor'})

data_3=data['genres', 'movieId', 'rating', 'Actor_2']
data_3.rename({'Actor_2': 'Actor'})

data_4=data['genres', 'movieId', 'rating', 'Actor_3']
data_4.rename({'Actor_3': 'Actor'})

actor_genres = data_1.append(data_2).append(data_3).append(data_4)

print actor_genres.shape
actor_genres.head()
(419036, 4)
Out[133]:
genres movieId rating Actor
Crime|Drama 16 4.0 Robert De Niro
Drama|Sci-Fi 24 1.5 Mary Steenburgen
Mystery|Sci-Fi|Thriller 32 4.0 Bruce Willis
Mystery|Thriller 47 4.0 Morgan Freeman
Crime|Mystery|Thriller 50 4.0 Kevin Spacey
Action|Drama|War 110 4.0 Mel Gibson
Adventure|Drama|IMAX 150 3.0 Tom Hanks
Drama|Thriller|War 161 4.0 Gene Hackman
Action|Crime|Thriller 165 3.0 Bruce Willis
Action 204 0.5 Steven Seagal
[10 rows x 4 columns]

Defining Actor Genre dataframe with Genres broken out

The code below takes a step further and splits the genres and associates them with the actors

In [134]:
df_genre = actor_genres.to_dataframe()

s = df_genre['genres'].str.split('|').apply(pd.Series, 1).stack()

s.index = s.index.droplevel(-1) # to line up with df's index

s.name = 'GenreSplit'

actor_genre_split = gl.SFrame(data=pd.concat([df_genre, s.to_frame()], axis=1, join='inner'))
actor_genre_split.head()
Out[134]:
genres movieId rating Actor GenreSplit
Crime|Drama 16 4.0 Robert De Niro Crime
Crime|Drama 16 4.0 Robert De Niro Drama
Drama|Sci-Fi 24 1.5 Mary Steenburgen Drama
Drama|Sci-Fi 24 1.5 Mary Steenburgen Sci-Fi
Mystery|Sci-Fi|Thriller 32 4.0 Bruce Willis Mystery
Mystery|Sci-Fi|Thriller 32 4.0 Bruce Willis Sci-Fi
Mystery|Sci-Fi|Thriller 32 4.0 Bruce Willis Thriller
Mystery|Thriller 47 4.0 Morgan Freeman Mystery
Mystery|Thriller 47 4.0 Morgan Freeman Thriller
Crime|Mystery|Thriller 50 4.0 Kevin Spacey Crime
[10 rows x 5 columns]

Evaluating recommenders for Actor recommendations given a movie's genre (genres not split)

In [144]:
# Split the data into a single training and test set
train, test = gl.recommender.util.random_split_by_user(actor_genres,
                                                       user_id="genres", 
                                                       item_id="Actor",
                                                       max_num_users=None, #None: use all available users for test set
                                                       item_test_proportion=0.2) #80/20 train/test split

First try modeling with item_similarity_recommender

In [145]:
# Define model parameters
params = {'user_id': 'genres', 
          'item_id': 'Actor', 
          'target': 'rating',
          'similarity_type':['jaccard','cosine','pearson'],
          'only_top_k':[5,10,25,64]}

item_item_actor_gs = gl.model_parameter_search.random_search.create((train,test),
        gl.recommender.item_similarity_recommender.create,
        params,
        max_models=5,
        environment=None)
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-31-4900000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-31-4900000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-18-2016-19-31-4900000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-18-2016-19-31-4900000-87299'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-31-4900000-87299' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-31-4900000-87299' scheduled.
In [146]:
item_item_actor_gs.get_results()
Out[146]:
model_id item_id only_top_k similarity_type target user_id training_precision@5 training_recall@5 training_rmse
1 Actor 10 jaccard rating genres 0.526860841424 0.11533038307 None
0 Actor 25 pearson rating genres 0.0017259978425 0.000301460794825 None
3 Actor 5 pearson rating genres 0.0017259978425 0.000584129749765 None
2 Actor 5 cosine rating genres 0.538727076591 0.117663136805 None
4 Actor 10 jaccard rating genres 0.527292340885 0.114250324851 None
validation_precision@5 validation_recall@5 validation_rmse
0.00222772277228 0.00144586291955 3.63284551797
0.0 0.0 1.12278091985
0.0 0.0 1.12407176409
0.00173267326733 0.00140425082536 3.58198932523
0.00173267326733 0.00140068092683 3.63282762966
[5 rows x 12 columns]

Second, try a ranking_factorization_model

In [147]:
# Define model parameters
params = {'user_id': 'genres', 
          'item_id': 'Actor', 
          'target': 'rating',
          'num_factors': [6, 12, 24], 
          'regularization':[1e-12,1e-8,1e-4,1],
          'linear_regularization': [1e-12,1e-8,1e-4,1],
          'ranking_regularization':[0, 0.1, 0.5, 1]}

ranking_fac_rec_actor_gs = gl.model_parameter_search.random_search.create((train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=5,
        environment=None)
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-33-0700000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-33-0700000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-18-2016-19-33-0700000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-18-2016-19-33-0700000-c919a'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-33-0700000-c919a' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-33-0700000-c919a' scheduled.
In [148]:
ranking_fac_rec_actor_gs.get_results()
Out[148]:
model_id item_id linear_regularization num_factors ranking_regularization regularization target user_id
1 Actor 0.0001 24 0.0 0.0001 rating genres
0 Actor 1.0 6 1.0 1.0 rating genres
3 Actor 0.0001 24 0.0 0.0001 rating genres
2 Actor 1e-08 12 0.1 1e-08 rating genres
4 Actor 1e-12 6 0.0 1.0 rating genres
training_precision@5 training_recall@5 training_rmse validation_precision@5 validation_recall@5
0.0390507011866 0.000188822366388 0.928080449273 0.000247524752475 2.13016138103e-06
0.0181229773463 0.000482036027387 1.04408806184 0.000742574257426 0.000129118308918
0.0606256742179 0.000291590000249 0.920097312571 0.00049504950495 1.73558624037e-06
0.0746494066882 0.00125372731069 0.91241057419 0.00049504950495 1.39096968907e-05
0.0155339805825 0.000259699922518 1.03466828602 0.00049504950495 0.000128919141914
validation_rmse
0.964416921923
1.04604399114
0.961752205426
1.01208349546
1.03698984609
[5 rows x 14 columns]

Model comparisons

In [149]:
item_item_actor_genre = item_item_actor_gs.get_models()
ranking_fac_rec_actor_genre = ranking_fac_rec_actor_gs.get_models()

actor_genre_model_List = []

#first set of models in list are model_0-4
actor_genre_model_List = [model for model in item_item_actor_genre]

#second set of models in list are model_5-9
actor_genre_model_List = actor_genre_model_List + [model for model in ranking_fac_rec_actor_genre]

comparison_Actor = gl.compare(test, actor_genre_model_List)

gl.show_comparison(comparison_Actor, actor_genre_model_List)
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-------------+
| cutoff | mean_precision | mean_recall |
+--------+----------------+-------------+
|   1    |      0.0       |     0.0     |
|   2    |      0.0       |     0.0     |
|   3    |      0.0       |     0.0     |
|   4    |      0.0       |     0.0     |
|   5    |      0.0       |     0.0     |
|   6    |      0.0       |     0.0     |
|   7    |      0.0       |     0.0     |
|   8    |      0.0       |     0.0     |
|   9    |      0.0       |     0.0     |
|   10   |      0.0       |     0.0     |
+--------+----------------+-------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    | 0.00247524752475 | 0.000137741968923 |
|   2    | 0.00185643564356 | 0.000139000996148 |
|   3    | 0.00247524752475 |  0.00137732953509 |
|   4    | 0.00216584158416 |  0.00137983484635 |
|   5    | 0.00222772277228 |  0.00144586291955 |
|   6    | 0.00247524752475 |  0.00144964598774 |
|   7    | 0.0021216407355  |  0.00144964598774 |
|   8    | 0.00185643564356 |  0.00144964598774 |
|   9    | 0.0016501650165  |  0.00144964598774 |
|   10   | 0.00160891089109 |  0.00145010082925 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M2

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |  0.00123762376238 | 0.000137513751375 |
|   2    |  0.00185643564356 | 0.000161064310122 |
|   3    |  0.0016501650165  | 0.000161263477126 |
|   4    |  0.00154702970297 | 0.000161491694674 |
|   5    |  0.00173267326733 |  0.00140425082536 |
|   6    |  0.00144389438944 |  0.00140425082536 |
|   7    |  0.00123762376238 |  0.00140425082536 |
|   8    |  0.00123762376238 |  0.00140675613662 |
|   9    |   0.001100110011  |  0.00140675613662 |
|   10   | 0.000990099009901 |  0.00140675613662 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M3

Precision and recall summary statistics by cutoff
+--------+----------------+-------------+
| cutoff | mean_precision | mean_recall |
+--------+----------------+-------------+
|   1    |      0.0       |     0.0     |
|   2    |      0.0       |     0.0     |
|   3    |      0.0       |     0.0     |
|   4    |      0.0       |     0.0     |
|   5    |      0.0       |     0.0     |
|   6    |      0.0       |     0.0     |
|   7    |      0.0       |     0.0     |
|   8    |      0.0       |     0.0     |
|   9    |      0.0       |     0.0     |
|   10   |      0.0       |     0.0     |
+--------+----------------+-------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M4

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    | 0.00247524752475 | 0.000137741968923 |
|   2    | 0.00185643564356 | 0.000139000996148 |
|   3    | 0.00206270627063 |  0.00137710131754 |
|   4    | 0.00185643564356 |  0.00137732953509 |
|   5    | 0.00173267326733 |  0.00140068092683 |
|   6    | 0.0016501650165  |  0.00140113576835 |
|   7    | 0.00159123055163 |  0.00144381244981 |
|   8    | 0.00170173267327 |  0.00144916942872 |
|   9    | 0.00151265126513 |  0.00144916942872 |
|   10   | 0.00136138613861 |  0.00144916942872 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M5

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    |        0.0        |        0.0        |
|   3    | 0.000412541254125 | 1.99167003923e-07 |
|   4    | 0.000309405940594 | 1.99167003923e-07 |
|   5    | 0.000742574257426 | 0.000129118308918 |
|   6    | 0.000618811881188 | 0.000129118308918 |
|   7    | 0.000530410183876 | 0.000129118308918 |
|   8    | 0.000618811881188 | 0.000148158674493 |
|   9    | 0.000550055005501 | 0.000148158674493 |
|   10   | 0.000618811881188 | 0.000152470952411 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M6

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    |        0.0        |        0.0        |
|   3    | 0.000412541254125 | 2.13016138103e-06 |
|   4    | 0.000309405940594 | 2.13016138103e-06 |
|   5    | 0.000247524752475 | 2.13016138103e-06 |
|   6    | 0.000206270627063 | 2.13016138103e-06 |
|   7    |  0.00035360678925 | 1.50220755724e-05 |
|   8    | 0.000309405940594 | 1.50220755724e-05 |
|   9    |  0.00027502750275 | 1.50220755724e-05 |
|   10   | 0.000247524752475 | 1.50220755724e-05 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M7

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |  0.00123762376238 | 1.01778269932e-06 |
|   2    |  0.00123762376238 | 1.39096968907e-05 |
|   3    | 0.000825082508251 | 1.39096968907e-05 |
|   4    | 0.000618811881188 | 1.39096968907e-05 |
|   5    |  0.00049504950495 | 1.39096968907e-05 |
|   6    | 0.000618811881188 | 1.41088638947e-05 |
|   7    | 0.000884016973126 | 6.62076896571e-05 |
|   8    | 0.000928217821782 | 9.19915180399e-05 |
|   9    |   0.001100110011  | 0.000119379307225 |
|   10   | 0.000990099009901 | 0.000119379307225 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M8

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |  0.00123762376238 | 4.76559015162e-07 |
|   2    | 0.000618811881188 | 4.76559015162e-07 |
|   3    | 0.000412541254125 | 4.76559015162e-07 |
|   4    | 0.000618811881188 | 1.73558624037e-06 |
|   5    |  0.00049504950495 | 1.73558624037e-06 |
|   6    | 0.000618811881188 | 2.99461346557e-06 |
|   7    | 0.000707213578501 |  3.3935207525e-05 |
|   8    | 0.000618811881188 |  3.3935207525e-05 |
|   9    | 0.000550055005501 |  3.3935207525e-05 |
|   10   |  0.00049504950495 |  3.3935207525e-05 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M9

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    |        0.0        |        0.0        |
|   3    |        0.0        |        0.0        |
|   4    | 0.000618811881188 | 0.000128919141914 |
|   5    |  0.00049504950495 | 0.000128919141914 |
|   6    | 0.000412541254125 | 0.000128919141914 |
|   7    |  0.00035360678925 | 0.000128919141914 |
|   8    | 0.000464108910891 | 0.000129118308918 |
|   9    | 0.000412541254125 | 0.000129118308918 |
|   10   | 0.000371287128713 | 0.000129118308918 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

From the above precision-recall plot, we can see that model_1, which belongs to the item_similarity recommender performs better than the other models. The models near model_1 are all item_similarity models.

first set of models in list are model_0-4: item_similarity
second set of models in list are model_5-9: ranking_fact

In [152]:
actor_genre_model_List[1]
Out[152]:
Class                            : ItemSimilarityRecommender

Schema
------
User ID                          : genres
Item ID                          : Actor
Target                           : rating
Additional observation features  : 0
User side features               : []
Item side features               : []

Statistics
----------
Number of observations           : 335250
Number of users                  : 927
Number of items                  : 14372

Training summary
----------------
Training time                    : 5.2915

Model Parameters
----------------
Model class                      : ItemSimilarityRecommender
threshold                        : 0.001
similarity_type                  : jaccard
training_method                  : auto

Other Settings
--------------
degree_approximation_threshold   : 4096
sparse_density_estimation_sample_size : 4096
max_data_passes                  : 4096
target_memory_usage              : 8589934592
seed_item_set_size               : 50
nearest_neighbors_interaction_proportion_threshold : 0.05
max_item_neighborhood_size       : 10
In [153]:
actor_genre_rec = gl.recommender.item_similarity_recommender.create(actor_genres, 
                                  user_id="genres", 
                                  item_id="Actor", 
                                  target="rating",
                                  similarity_type='jaccard')

results = actor_genre_rec.get_similar_items(k=2)
results
Recsys training: model = item_similarity
Warning: Ignoring columns movieId;
    To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
    Data has 419036 observations with 931 users and 15458 items.
    Data prepared in: 1.23101s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 19.263ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 1.25s                               | 1.25             | 220             |
| 9.39s                               | 100              | 15458           |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 9.53765s
Out[153]:
Actor similar score rank
Robert De Niro Sean Penn 0.342105269432 1
Robert De Niro Gene Hackman 0.333333313465 2
Mary Steenburgen Imelda Staunton 0.416666686535 1
Mary Steenburgen Frank Whaley 0.384615361691 2
Bruce Willis Jason Statham 0.254901945591 1
Bruce Willis Willem Dafoe 0.254545450211 2
Morgan Freeman Gene Hackman 0.355555534363 1
Morgan Freeman Tommy Lee Jones 0.311111092567 2
Kevin Spacey Andy Garcia 0.3125 1
Kevin Spacey Jeff Bridges 0.272727251053 2
[30916 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Although the precision recall scores are not high, some of the results here make sense. Bruce Willis is very similar to Jason Statham. This model does warrent further improvements, however.

Recommender where genres are broken out individually

In [154]:
actor_genre_split.head()
Out[154]:
genres movieId rating Actor GenreSplit
Crime|Drama 16 4.0 Robert De Niro Crime
Crime|Drama 16 4.0 Robert De Niro Drama
Drama|Sci-Fi 24 1.5 Mary Steenburgen Drama
Drama|Sci-Fi 24 1.5 Mary Steenburgen Sci-Fi
Mystery|Sci-Fi|Thriller 32 4.0 Bruce Willis Mystery
Mystery|Sci-Fi|Thriller 32 4.0 Bruce Willis Sci-Fi
Mystery|Sci-Fi|Thriller 32 4.0 Bruce Willis Thriller
Mystery|Thriller 47 4.0 Morgan Freeman Mystery
Mystery|Thriller 47 4.0 Morgan Freeman Thriller
Crime|Mystery|Thriller 50 4.0 Kevin Spacey Crime
[10 rows x 5 columns]
In [157]:
# Split the data into a single training and test set
train, test = gl.recommender.util.random_split_by_user(actor_genre_split,
                                                       user_id="GenreSplit", 
                                                       item_id="Actor",
                                                       max_num_users=None, #None: use all available users for test set
                                                       item_test_proportion=0.2) #80/20 train/test split
In [158]:
# Define model parameters
params = {'user_id': 'GenreSplit', 
          'item_id': 'Actor', 
          'target': 'rating',
          'similarity_type':['jaccard','cosine','pearson'],
          'only_top_k':[5,10,25,64]}

item_item_actor_genre_split_gs = gl.model_parameter_search.random_search.create((train,test),
        gl.recommender.item_similarity_recommender.create,
        params,
        max_models=5,
        environment=None)
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-42-2500000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-42-2500000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-18-2016-19-42-2500000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-18-2016-19-42-2500000-b208f'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-42-2500000-b208f' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-42-2500000-b208f' scheduled.
In [159]:
# Define model parameters
params = {'user_id': 'GenreSplit', 
          'item_id': 'Actor', 
          'target': 'rating',
          'num_factors': [6, 12, 24], 
          'regularization':[1e-12,1e-8,1e-4,1],
          'linear_regularization': [1e-12,1e-8,1e-4,1],
          'ranking_regularization':[0, 0.1, 0.5, 1]}

ranking_fac_rec_actor_genre_split_gs = gl.model_parameter_search.random_search.create((train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=5,
        environment=None)
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-42-5300000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-42-5300000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-18-2016-19-42-5300000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-18-2016-19-42-5300000-0c430'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-18-2016-19-42-5300000-0c430' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-18-2016-19-42-5300000-0c430' scheduled.
In [160]:
item_item_actor_genre_split = item_item_actor_genre_split_gs.get_models()
ranking_fac_rec_actor_genre_split = ranking_fac_rec_actor_genre_split_gs.get_models()

actor_genre_split_model_List = []

#first set of models in list are model_0-4
actor_genre_split_model_List = [model for model in item_item_actor_genre_split]

#second set of models in list are model_5-9
actor_genre_split_model_List = actor_genre_split_model_List + [model for model in ranking_fac_rec_actor_genre_split]

comparison_Actor = gl.compare(test, actor_genre_split_model_List)

gl.show_comparison(comparison_Actor, actor_genre_split_model_List)
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+-------------------+
| cutoff |  mean_precision |    mean_recall    |
+--------+-----------------+-------------------+
|   1    |       0.1       |  1.2374455746e-05 |
|   2    |       0.05      |  1.2374455746e-05 |
|   3    | 0.0333333333333 |  1.2374455746e-05 |
|   4    |      0.0375     |  1.5642002472e-05 |
|   5    |       0.03      |  1.5642002472e-05 |
|   6    |      0.025      |  1.5642002472e-05 |
|   7    | 0.0285714285714 | 1.69749803445e-05 |
|   8    |      0.025      | 1.69749803445e-05 |
|   9    | 0.0222222222222 | 1.69749803445e-05 |
|   10   |       0.03      | 1.99643003511e-05 |
+--------+-----------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    |       0.0        |        0.0        |
|   2    |       0.0        |        0.0        |
|   3    |       0.0        |        0.0        |
|   4    |       0.0        |        0.0        |
|   5    |       0.01       | 7.52219046186e-06 |
|   6    | 0.00833333333333 | 7.52219046186e-06 |
|   7    | 0.00714285714286 | 7.52219046186e-06 |
|   8    |     0.00625      | 7.52219046186e-06 |
|   9    | 0.0111111111111  | 9.17853259589e-06 |
|   10   |       0.01       | 9.17853259589e-06 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M2

Precision and recall summary statistics by cutoff
+--------+-----------------+-------------------+
| cutoff |  mean_precision |    mean_recall    |
+--------+-----------------+-------------------+
|   1    |       0.0       |        0.0        |
|   2    |      0.025      | 3.26754672592e-06 |
|   3    | 0.0166666666667 | 3.26754672592e-06 |
|   4    |      0.025      | 4.92388885995e-06 |
|   5    |       0.03      | 6.58023099398e-06 |
|   6    |      0.025      | 6.58023099398e-06 |
|   7    | 0.0285714285714 |  1.7298344606e-05 |
|   8    |     0.03125     |  1.895468674e-05  |
|   9    | 0.0277777777778 |  1.895468674e-05  |
|   10   |      0.025      |  1.895468674e-05  |
+--------+-----------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M3

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    |       0.0        |        0.0        |
|   2    |      0.025       | 1.65634213403e-06 |
|   3    | 0.0166666666667  | 1.65634213403e-06 |
|   4    |      0.0125      | 1.65634213403e-06 |
|   5    |       0.01       | 1.65634213403e-06 |
|   6    | 0.00833333333333 | 1.65634213403e-06 |
|   7    | 0.00714285714286 | 1.65634213403e-06 |
|   8    |     0.00625      | 1.65634213403e-06 |
|   9    | 0.0166666666667  | 1.05115104685e-05 |
|   10   |      0.025       | 4.86967611063e-05 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M4

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    |       0.0        |        0.0        |
|   2    |       0.0        |        0.0        |
|   3    |       0.0        |        0.0        |
|   4    |      0.0125      | 1.65634213403e-06 |
|   5    |       0.01       | 1.65634213403e-06 |
|   6    | 0.00833333333333 | 1.65634213403e-06 |
|   7    | 0.00714285714286 | 1.65634213403e-06 |
|   8    |     0.00625      | 1.65634213403e-06 |
|   9    | 0.00555555555556 | 1.65634213403e-06 |
|   10   |      0.005       | 1.65634213403e-06 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M5

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    |       0.0        |        0.0        |
|   2    |       0.0        |        0.0        |
|   3    | 0.0166666666667  | 1.65634213403e-06 |
|   4    |      0.0125      | 1.65634213403e-06 |
|   5    |       0.01       | 1.65634213403e-06 |
|   6    | 0.00833333333333 | 1.65634213403e-06 |
|   7    | 0.00714285714286 | 1.65634213403e-06 |
|   8    |     0.00625      | 1.65634213403e-06 |
|   9    | 0.00555555555556 | 1.65634213403e-06 |
|   10   |      0.005       | 1.65634213403e-06 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M6

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    |       0.0        |        0.0        |
|   2    |      0.025       | 3.26754672592e-06 |
|   3    | 0.0166666666667  | 3.26754672592e-06 |
|   4    |      0.0125      | 3.26754672592e-06 |
|   5    |       0.01       | 3.26754672592e-06 |
|   6    | 0.00833333333333 | 3.26754672592e-06 |
|   7    | 0.00714285714286 | 3.26754672592e-06 |
|   8    |     0.00625      | 3.26754672592e-06 |
|   9    | 0.00555555555556 | 3.26754672592e-06 |
|   10   |      0.005       | 3.26754672592e-06 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M7

Precision and recall summary statistics by cutoff
+--------+-----------------+-------------------+
| cutoff |  mean_precision |    mean_recall    |
+--------+-----------------+-------------------+
|   1    |       0.0       |        0.0        |
|   2    |       0.0       |        0.0        |
|   3    | 0.0333333333333 | 5.27051111321e-06 |
|   4    |      0.025      | 5.27051111321e-06 |
|   5    |       0.03      |  1.6428734724e-05 |
|   6    |      0.025      |  1.6428734724e-05 |
|   7    | 0.0214285714286 |  1.6428734724e-05 |
|   8    |     0.01875     |  1.6428734724e-05 |
|   9    | 0.0166666666667 |  1.6428734724e-05 |
|   10   |      0.015      |  1.6428734724e-05 |
+--------+-----------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M8

Precision and recall summary statistics by cutoff
+--------+-----------------+-------------------+
| cutoff |  mean_precision |    mean_recall    |
+--------+-----------------+-------------------+
|   1    |       0.05      | 7.82962730974e-06 |
|   2    |      0.025      | 7.82962730974e-06 |
|   3    | 0.0166666666667 | 7.82962730974e-06 |
|   4    |      0.025      | 1.12774276131e-05 |
|   5    |       0.03      | 1.70604486633e-05 |
|   6    | 0.0333333333333 | 1.90634130506e-05 |
|   7    | 0.0285714285714 | 1.90634130506e-05 |
|   8    |     0.04375     | 3.76827775482e-05 |
|   9    | 0.0388888888889 | 3.76827775482e-05 |
|   10   |       0.04      | 3.96857419355e-05 |
+--------+-----------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M9

Precision and recall summary statistics by cutoff
+--------+------------------+-------------------+
| cutoff |  mean_precision  |    mean_recall    |
+--------+------------------+-------------------+
|   1    |       0.0        |        0.0        |
|   2    |      0.025       | 7.82962730974e-06 |
|   3    | 0.0166666666667  | 7.82962730974e-06 |
|   4    |      0.0125      | 7.82962730974e-06 |
|   5    |       0.01       | 7.82962730974e-06 |
|   6    | 0.00833333333333 | 7.82962730974e-06 |
|   7    | 0.00714285714286 | 7.82962730974e-06 |
|   8    |     0.00625      | 7.82962730974e-06 |
|   9    | 0.0111111111111  | 1.36126483599e-05 |
|   10   |      0.015       | 1.93956694101e-05 |
+--------+------------------+-------------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

The recall results here are so small that recall only shows 0 for all values. Precision reaches a maximum of 0.1, achieved by an item_similarity model.

We model with an item_similarity model below to see the results.

In [162]:
actor_genre_split_rec = gl.recommender.item_similarity_recommender.create(actor_genre_split, 
                                  user_id="GenreSplit", 
                                  item_id="Actor", 
                                  target="rating",
                                  similarity_type='jaccard')

results = actor_genre_split_rec.get_similar_items(k=2)
results
Recsys training: model = item_similarity
Warning: Ignoring columns genres, movieId;
    To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
    Data has 1121936 observations with 20 users and 15458 items.
    Data prepared in: 2.25586s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 19.557ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 1.05s                               | 2.25             | 365             |
| 2.09s                               | 95.5             | 14772           |
| 9.66s                               | 100              | 15458           |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 9.93172s
Out[162]:
Actor similar score rank
Robert De Niro John Candy 0.666666686535 1
Robert De Niro Tim Curry 0.666666686535 2
Mary Steenburgen Marguerite Churchill 0.5 1
Mary Steenburgen Megumi Hayashibara 0.5 2
Bruce Willis Daniel Craig 0.736842095852 1
Bruce Willis Charlize Theron 0.699999988079 2
Morgan Freeman Milla Jovovich 0.611111104488 1
Morgan Freeman Anthony Hopkins 0.611111104488 2
Kevin Spacey Martin Freeman 0.666666686535 1
Kevin Spacey Alec Baldwin 0.631578922272 2
[30862 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

These results do not appear to do as well as the models where the genre is not split.